***************************
* Market Learning Project
* Clean Dartmouth provider crosswalk
* cleanXwalk.do
***************************

//preliminaries
clear all
set more off
ssc install a2reg 
** set the working directory
cd "/disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/replication/"

* what is the terminal year of the constructed crosswalk
* dartmouth file runs through 2016, POS file runs through 2014
global MAXYEAR = 2014

// END USER CONFIG

* create crosswalk from medicare pn to "merged" medicare pn
* we use the dartmouth crosswalk file
* we merge together all provider numbers that have ever corresponded to the same
* institution and consider those pn's together to be one composite pn

** Generates data file from excel file with zip codes - HRR -HSA mapping from 2005 **
** This file is taken straight from publicly available dartmouth atlas except variable names are changed for ease **
clear
import excel Exportable_RawData/ZipHsaHrr05.xls, firstrow
rename zipcode05 zip_master
rename hrrnum hrrnum_new
rename hsanum hsanum_new
tempfile temp_hrr
save `temp_hrr'


* bring in dartmouth data
** NOT PUBLIC AVAILABLE DATA: This is a dataset that is not publicly available
clear
import excel "/disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/mlearn/source_shared/dartmouth-xwalk/HospitalReport2016.xls", firstrow sheet("MATRIX_1992_to_2016")
keep Provider start_* end_*
generate PROVIDER = substr(Provider, 3, 6)
drop Provider
tempfile temp_matrix
save `temp_matrix'

clear
import excel "/disk/agedisk3/medicare.work/chandra-DUA52080/pragya-dua52080/mlearn/source_shared/dartmouth-xwalk/HospitalReport2016.xls", firstrow sheet("PROV6_2016")
keep Provider SSLS Hospital_Name Zipcode HSA HRR
rename Provider PROVIDER
rename Hospital_Name PNAME
rename Zipcode PZIP
** This collapse eliminates one provider (010167) who appears twice; in both appearances this provider has the same info** 
collapse (first) HRR HSA PNAME PZIP SSLS, by(PROVIDER)

** Note this eliminates 10 hospitals that appears only in MATRIX_1992_to_2016 **
** (Providers 030135, 100324, 100325, 193095, 193111, 230304, 280135, 520209, 670104, 670105)
** This also eliminates one provider that appears only in PROV6_2016 **
** (Provider 010167) **
** Of the providers in `temp_matrix', **
** all start in 2014 or 2015 though and some of them have an odd pattern of going from **
** "no" to having an ID to being "closed"  ** 

merge 1:1 PROVIDER using `temp_matrix', keep(match)
keep PROVIDER SSLS start_* end_* HRR HSA PNAME PZIP

* the provider is the starting institution pn
rename PROVIDER pn
rename HRR hrrnum
rename HSA hsanum
rename SSLS ssls
rename PZIP zip_dxwlk

destring ssls zip_dxwlk, replace

* first year pn is active
gen first_year = .
* last year pn is active
gen last_year = .
* year pn closes
gen close_year = .

forvalues year=1993/2016 {
	foreach part in start end {
		* first_year is the year of the first entry that was filled in for the row
		replace first_year = `year' if `part'_`year'!="" & first_year==.
		* close_year is the year where the row says the pn "Closed"
		replace close_year = `year' if `part'_`year'=="Closed"
	}
	* last_year is the last year the row had a value in it
	* because this setting occurs in a year loop, last_year will be set to the last
	* year with a value filled in start_`year' or end_`year'
	replace last_year = `year' if start_`year'!="" | end_`year'!=""
}

* deal with issue that xwalk can run past MAXYEAR

* eliminate facilities that first appear after MAXYEAR
drop if first_year != . & first_year>$MAXYEAR

* eliminate transition matrix component after MAXYEAR
if ($MAXYEAR < 2016) {
	local yaftermax = $MAXYEAR + 1
	forvalues year=`yaftermax'/2016 {
		drop start_`year' end_`year'
	}
}

* if last year > MAXYEAR, it was open at end of last year. set to missing
* so we mimic if we only observed through MAXYEAR
replace last_year = . if last_year > $MAXYEAR
* likewise for close_year
replace close_year = . if close_year > $MAXYEAR

* save data on starting pn's from dartmouth file

preserve

keep pn hrrnum hsanum ssls zip_dxwlk first_year last_year close_year

tempfile temp_pn_dxwlk
save `temp_pn_dxwlk'

restore

* next step: construct pn_new's by finding the overlapping pn's

* the transition matrix includes non-pn values. remove them
foreach var of varlist start_* end_* {
	replace `var' = "" if `var'=="Closed"|`var'=="no"
}

* now we'll reshape long
* fix variable names in preparation for that

* the "employer" is the institution's starting pn
rename pn employer

* the "worker" is the institution's pn at the particular moment (e.g. start of 2004)

* year will be YYYY1 for start of year and YYYY2 for end of year
foreach var of varlist start_* {
	local y = substr("`var'",-4,4)
	rename `var' worker`y'1
}

foreach var of varlist end_* {
	local y = substr("`var'",-4,4)
	rename `var' worker`y'2
}

reshape long worker, i(employer) j(year)
drop if worker==""

* convert year=YYYYS to YYYY (year) and S (start/end of year)
gen startend = mod(year,10)
replace year = floor(year/10)

* to figure out which pn's (workers) share institutions (employers), use the
* mobility group tool

a2group , individual(worker) unit(employer) groupvar(mobgrp)

* the mobility groups are the new "merged" "composite" pn's (the pn_new's)

keep mobgrp worker
duplicates drop

rename mobgrp pn_new
rename worker pn

sort pn

* now that we've made the groups, time to do some sanity checking and cleanup

* bring in data on the pn's from the dartmouth file
merge 1:1 pn using `temp_pn_dxwlk', gen(match_dxwlk)

* bring in data on the pn's from the CMS providers of services file
* for each PN, we use its entry from the last year it existed in the file
** This POS file was obtained from Adam Sacarny and goes through 2014 ** 
merge 1:1 pn using Exportable_RawData/pos_lastyear.dta, gen(match_pos) keep(match master)
* this variable is not useful and too similarly named to last_year
drop lastyear

* 6 pn's are in the dxwlk data (have a row in the xwalk for the institution)
* but never appear to be open. these 6 hospitals all terminated their pn's before 1993
* so they are irrelevant
assert match_pos==3 & termdate!= . & termdate<mdy(1,1,1993) if match_dxwlk==2
drop if match_dxwlk==2

* 337 pn's are in the hospital transition matrix but don't have a row in the dxwlk
* data. this seems to happen when the hospital transitions to a non-shortterm/CAH
* facility. don't care about the converted facility (we're focusing on shortterm/CAH)
* so we can drop those

* indicator for whether the pn is a shortterm/CAH facility based on its pn
* see http://www.resdac.org/sites/resdac.org/files/Provider%20Number%20Table_0.txt
gen factype_pn = real(substr(pn,3,2))
gen shortcah_pn = factype_pn!= . & ( (factype_pn>=0 & factype_pn<=8) | factype_pn==13 )
  
* in our data, a hospital that switches to a non-shortterm/cah facility is "closed"
* remove them
drop if shortcah_pn==0
drop factype_pn shortcah_pn

** There are some hospitals that don't match the pos crosswalk and do not appear in **
** the dartmouth xwalk except in the actual matrix (they do not have their own line) **
** We are dropping these observations because they are likely not important ** 
** Note as a result of this some hospitals will have no name because they appear only in POS ** 

drop if match_dxwlk != 3 & match_pos != 3 

** Generates a master zip code variable relying in dartmouth zip code and pos zip code where dartmouth is unavailable ** 
generate zip_master = zip_dxwlk
replace zip_master = zip if zip_master == . 

** Merges zip code to hrr and hsa using crosswalk from 2005 ** 
** This only gets used for hospitals appearing only in the POS file ** 
merge m:1 zip_master using `temp_hrr', keep(master match)
** Make sure all the hospitals appearing only in the POS file get an HRR & HSA ** 
assert (match_dxwlk != 3 & _merge == 3) | (match_dxwlk == 3)

replace hrrnum = hrrnum_new if match_dxwlk != 3
replace hsanum = hsanum_new if match_dxwlk != 3


** There are some hospitals where dartmouth and POS disagree on CAH status ** 
** We are letting dartmouth take precedence ** 


gen byte cah_new = ssls
replace cah_new = cah if cah_new == . 
gen byte shortterm_new = !ssls
replace shortterm_new = shortterm if shortterm_new == . 

drop cah 
rename cah_new cah

drop shortterm
rename shortterm_new shortterm

* save the crosswalk from pn's to the synthetic pn_new's
preserve
keep pn pn_new
isid pn
sort pn
capture noisily mkdir Intermediate_Output_Not_Exportable
save Intermediate_Output_Not_Exportable/pn2pn_new.dta, replace
restore

replace last_year = $MAXYEAR+1 if last_year==.

* what's the last year that a pn is open for this pn_new?
egen maxlastyear = max(last_year), by(pn_new)

* was this pn open in the last year for pn_new?
gen stillopen = last_year==maxlastyear

** We are picking the hospital for each pn_new that was open in the last period ** 
** And had the most beds in the last period (when there are multiple open ** 
keep if stillopen 
sort beds_tot
collapse (last) pn hrrnum hsanum zip_master name last_year, by(pn_new)

rename zip_master zip
rename pn location_pn
rename last_year location_year

* save the location data for each pn_new
* this is the file at the pn_new level
isid pn_new
sort pn_new
save Intermediate_Output_Not_Exportable/pn_new.dta, replace

